JCO Clinical Cancer Informatics
● American Society of Clinical Oncology (ASCO)
All preprints, ranked by how well they match JCO Clinical Cancer Informatics's content profile, based on 14 papers previously published here. The average preprint has a 0.13% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Miao, B. Y.; Rodriguez Almaraz, E.; Ashraf Ganjouei, A.; Suresh, A.; Zack, T.; Bravo, M.; Raghavendran, S.; Oskotsky, B.; Alaa, A.; Butte, A. J.
Show abstract
BackgroundMolecular biomarkers play a pivotal role in the diagnosis and treatment of oncologic diseases but staying updated with the latest guidelines and research can be challenging for healthcare professionals and patients. Large Language Models (LLMs), such as MedPalm-2 and GPT-4, have emerged as potential tools to streamline biomedical information extraction, but their ability to summarize molecular biomarkers for oncologic disease subtyping remains unclear. Auto-generation of clinical nomograms from text guidelines could illustrate a new type of utility for LLMs. MethodsIn this cross-sectional study, two LLMs, GPT-4 and Claude-2, were assessed for their ability to generate decision trees for molecular subtyping of oncologic diseases with and without expert-curated guidelines. Clinical evaluators assessed the accuracy of biomarker and cancer subtype generation, as well as validity of molecular subtyping decision trees across five cancer types: colorectal cancer, invasive ductal carcinoma, acute myeloid leukemia, diffuse large B-cell lymphoma, and diffuse glioma. ResultsBoth GPT-4 and Claude-2 "off the shelf" successfully produced clinical decision trees that contained valid instances of biomarkers and disease subtypes. Overall, GPT-4 and Claude-2 showed limited improvement in the accuracy of decision tree generation when guideline text was added. A Streamlit dashboard was developed for interactive exploration of subtyping trees generated for other oncologic diseases. ConclusionThis study demonstrates the potential of LLMs like GPT-4 and Claude-2 in aiding the summarization of molecular diagnostic guidelines in oncology. While effective in certain aspects, their performance highlights the need for careful interpretation, especially in zero-shot settings. Future research should focus on enhancing these models for more nuanced and probabilistic interpretations in clinical decision-making. The developed tools and methodologies present a promising avenue for expanding LLM applications in various medical specialties. Key Points- Large language models, such as GPT-4 and Claude-2, can generate clinical decision trees that summarize best-practice guidelines in oncology - Providing guidelines in the prompt query improves the accuracy of oncology biomarker and cancer subtype information extraction - However, providing guidelines in zero-shot settings does not significantly improve generation of clinical decision trees for either GPT-4 or Claude-2
Salome, P.; Knoll, M.; Walz, D.; Cogno, N.; Dedeoglu, A. S.; Qi, A. L.; Isakoff, S. J.; Abdollahi, A.; Jimenez, R. B.; Bitterman, D. S.; Paganetti, H.; Chamseddine, I.
Show abstract
Introduction: Manual data extraction from unstructured clinical notes is labor-intensive and impractical for large-scale clinical and research operations. Existing automated approaches typically require large language models, dedicated computational infrastructure, and/or task-specific fine-tuning that depends on curated data. The objective of this study is to enable accurate extraction with smaller locally deployed models using a disease-site specific pipeline and prompt configuration that are optimized and reusable. Materials/Methods: We developed OncoRAG, a four-phase pipeline that (1) generates feature-specific search terms via ontology enrichment, (2) constructs a clinical knowledge graph from notes using biomedical named entity recognition, (3) retrieves relevant context using graph-diffusion reranking, and (4) extracts features via structured prompts. We ran OncoRAG using Microsoft Phi-3-medium-instruct (14B parameters), a midsize language model deployed locally via Ollama. The pipeline was applied to three cohorts: triple-negative breast cancer (TNBC; npatients=104, nfeatures=42; primary development), recurrent high-grade glioma (RiCi; npatients=191, nfeatures=19; cross-lingual validation in German), and MIMIC-IV (npatients=100, nfeatures=10; external testing). Downstream task utility was assessed by comparing survival models for 3-year progression-free survival built from automatically extracted versus manually curated features. Results: The pipeline achieved mean F1 scores of 0.80 +/- 0.07 (TNBC; npatients=44, nfeatures=42), 0.79 +/- 0.12 (RiCi; npatients=61, nfeatures=19), and 0.84 +/- 0.06 (MIMIC-IV; npatients=100, nfeatures=10) on test sets under the automatic configuration. Compared to direct LLM prompting and naive RAG baselines, OncoRAG improved the mean F1-score by 0.19 to 0.22 and 0.17 to 0.19, respectively. Manual configuration refinement further improved the F1-score to 0.83 (TNBC) and 0.81 (RiCi), with no change in MIMIC-IV. Extraction time averaged 1.7-1.9 seconds per feature with the 14B model. Substituting a smaller 3.8B model reduced extraction time by 57%, with a decrease in F1-score (0.03-0.10). For TNBC, the extraction time was reduced from approximately two weeks of manual abstraction to under 2.5 hours. In an exploratory survival analysis, models using automatically extracted features showed a comparable C-index to those with manual curation (0.77 vs 0.76; 12 events). Conclusions: OncoRAG, deployed locally using a mid-size language model, achieved accurate feature extraction from multilingual oncology notes without fine-tuning. It was validated against manual extraction for both retrieval accuracy and survival model development. This locally deployable approach, which requires no external data sharing, addresses a critical bottleneck in scalable oncology research.
Xu, S.; Wang, Z.; Wang, H.; Ding, Z.; Zou, Y.; Cao, Y.
Show abstract
Online cancer peer-support communities generate large volumes of patient-authored and caregiver-authored text that may reflect distress, coping, and informational needs. Automated emotional tone classification could support scalable monitoring, but supervised modeling depends on label quality and may benefit from explicit context features. Using the Mental Health Insights: Vulnerable Cancer Survivors & Caregivers dataset, we compared five model families (TF-IDF Logistic Regression, Random Forest, LightGBM, GRU, and fine-tuned ALBERT) on a three-class target (Negative/Neutral/Positive) derived from four original categories. We introduced two extensions: (i) LLM-based annotation to generate parallel "AI labels" and (ii) token-based augmentation that prepends LLM-extracted structured variables (reporter role and cancer type) to the post text. Models were trained with a 60/20/20 stratified train/validation/test split, with hyperparameters selected on validation data only. Test performance was summarized using weighted F1 and macro one-vs-rest AUC with bootstrap confidence intervals, with paired comparisons based on McNemar tests and false discovery rate adjustment. The LLM annotator produced substantial redistribution in the four-class label space, shifting prevalence toward very negative relative to the original labels; the shift persisted but attenuated after collapsing to three classes. Across all model families, token augmen-tation improved held-out performance, with the largest gains for GRU and consistent improvements for ALBERT. Augmentation also reduced polarity-reversing errors (Nega-{leftrightarrow} tive Positive) for ALBERT, while adjacent errors (Negative {leftrightarrow} Neutral) remained the dominant residual failure mode. These results indicate that LLM-based supervision can introduce systematic measurement shifts that require auditing, yet LLM-extracted context incorporated via simple token augmentation provides a pragmatic, model-agnostic mechanism to improve downstream emotional tone classification for supportive oncology decision support. Author summaryWe studied how to better monitor emotional tone in posts from online cancer peer-support communities, where patients and caregivers share experiences that may signal distress, coping, or unmet needs. Automated classification could help organizations and moderators identify when additional support may be needed, but these systems depend on the quality of the labels used for training and may miss clinical context. Using a public dataset of cancer survivor and caregiver posts, we trained and compared several machine-learning and deep-learning models to classify each post as negative, neutral, or positive. We tested two practical improvements. First, we used a large language model to generate an additional set of "AI labels" and examined how these differed from the original categories. Second, we extracted simple context information--whether the writer was a patient or caregiver and what cancer type was mentioned--and added this context to the text before model training. We found that adding context consistently improved performance across model types. However, the AI-generated labels shifted class distributions, indicating that automated labeling can introduce systematic changes that should be audited. Overall, simple context extraction can make emotional tone monitoring more accurate and useful for supportive oncology decision support.
McInerney, S.; Gurku, H.; Balasubramanian, R.; Vikram, P.; Bhaskaran, S.; Sekaran, K.
Show abstract
ObjectivesTo evaluate the performance of SROTAS IQ, a custom fine-tuned large language model (LLM), in automating clinical trial eligibility screening for breast cancer patients using synthetic data. MethodsTen breast cancer trials were selected across diverse treatment settings and molecular subtypes. Fifteen synthetic patient summaries per trial were generated, including realistic and enriched eligibility scenarios. Two independent oncologists assessed trial eligibility for each patient, establishing ground truth. SROTAS IQ LLM was evaluated against expert consensus using standard classification metrics. Time-to-verdict was measured to compare clinician effort with automated assessment. ResultsSROTAS IQ demonstrated strong concordance with expert assessments, achieving 90% or greater accuracy in 5 of 10 trials. Across 150 patient-trial evaluations, the model correctly classified 88% of overall eligibility decisions. Performance was highest in trials with moderate complexity and fewer nested criteria, while more intricate protocols showed reduced accuracy. The LLM consistently delivered rapid assessments (<0.5 minutes per patient), with explainable outputs that aligned with clinical reasoning. These findings underscore the models potential to support high-fidelity, scalable trial matching in oncology. ConclusionSROTAS IQ offers a promising approach to automating clinical trial matching in oncology. Further real-world validation is needed to confirm generalisability and integration into clinical practice.
Jonnalagadda, P.; Obeng-Gyasi, S.; Stover, D. G.; Andersen, B. L.; Rahurkar, S.
Show abstract
BackgroundMany patients with triple-negative breast cancer (TNBC), particularly those who are older, Black, or insured by Medicaid, do not receive guideline-concordant treatment, despite its association with up to 4x higher survival. Early identification of patients at risk for rapid relapse may enable timely interventions and improve outcomes. This study applies machine learning (ML) to real-world data to predict risk of rapid relapse in TNBC. MethodsWe trained various ML models (logistic regression, decision trees, random forests, XGBoost, naive Bayes, support vector machines) using National Cancer Database (NCDB) data and fine-tuned them using electronic health record (EHR) data from a cancer registry. Class imbalance was addressed using synthetic minority oversampling technique (SMOTE). Model performance was evaluated using sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), receiver operating characteristics area under the curve ROC AUC, accuracy, and F1 scores. Transfer learning, cross-validation, and threshold optimization were applied to enhance the ensemble models performance on clinical data. ResultsInitial models trained on NCDB data exhibited high NPV but low sensitivity and PPV. SMOTE and hyperparameter tuning produced modest improvements. External testing on EHR data from a cancer registry had similar model performance. After applying transfer learning, cross-validation, and threshold optimization using the clinical data, the ensemble model achieved higher performance. The optimized ensemble model achieved a sensitivity of 0.87, specificity of 0.99, PPV of 0.90, NPV of 0.98, ROC AUC of 0.99, accuracy of 0.98, and F1-score of 0.88. This optimized model, leveraging readily available clinical data, demonstrated superior performance compared to initial NCDB-trained models and those reported in extant literature. ConclusionsTransfer learning and threshold optimization effectively adapted ML models trained on NCDB data to an independent real-world clinical dataset from a single site, producing a high-performing model for predicting rapid relapse in TNBC. This model, potentially translatable to fast health interoperability resources (FHIR)-compatible workflows, represents a promising tool for identifying patients at high risk. Future work should include prospective external validation, evaluation of integration into clinical workflows, and implementation studies to determine whether the model improves care processes such as timely patient navigation and treatment planning. Author SummaryIn this study, we set out to understand which patients with triple-negative breast cancer might experience a rapid return of their disease. Many people with this aggressive form of cancer do not receive the treatments that are known to improve survival, especially patients who are older, Black, or insured through public programs. Being able to identify those at highest risk early in their care could help health teams provide timely support and ensure that patients receive the treatments they need. To do this, we used information from a large national cancer database to build computer-based models that learn from patterns in patient data. We then refined these models using real medical records from a cancer center to make sure they worked well in everyday clinical settings. After adjusting and improving the models, we developed a tool that can correctly identify most patients who are likely to have a rapid return of their cancer. Our hope is that this type of tool could eventually be built into routine care and help guide timely follow-up, support services, and treatment planning. More testing in real clinical environments will be important to understand how well the tool improves care and outcomes for patients.
Adamson, B. J.; Waskom, M.; Blarre, A.; Kelly, J.; Krismer, K.; Nemeth, S.; Gipetti, J.; Ritten, J.; Harrison, K.; Ho, G.; Linzmayer, R.; Bansal, T.; Wilkinson, S.; Amster, G.; Estola, E.; Benedum, C. M.; Fidyk, E.; Estevez, M.; Shapiro, W.; Cohen, A. B.
Show abstract
BackgroundAs artificial intelligence (AI) continues to advance with breakthroughs in natural language processing (NLP) and machine learning (ML), such as the development of models like OpenAIs ChatGPT, new opportunities are emerging for efficient curation of electronic health records (EHR) into real-world data (RWD) for evidence generation in oncology. Our objective is to describe the research and development of industry methods to promote transparency and explainability. MethodsWe applied NLP with ML techniques to train, validate, and test the extraction of information from unstructured documents (eg, clinician notes, radiology reports, lab reports, etc.) to output a set of structured variables required for RWD analysis. This research used a nationwide electronic health record (EHR)-derived database. Models were selected based on performance. Variables curated with an approach using ML extraction are those where the value is determined solely based on an ML model (ie, not confirmed by abstraction), which identifies key information from visit notes and documents. These models do not predict future events or infer missing information. ResultsWe developed an approach using NLP and ML for extraction of clinically meaningful information from unstructured EHR documents and found high performance of output variables compared with variables curated by manually abstracted data. These extraction methods resulted in research-ready variables including initial cancer diagnosis with date, advanced/metastatic diagnosis with date, disease stage, histology, smoking status, surgery status with date, biomarker test results with dates, and oral treatments with dates. ConclusionsNLP and ML enable the extraction of retrospective clinical data in EHR with speed and scalability to help researchers learn from the experience of every person with cancer.
Jun, H.; Tanaka, Y.; Johri, S.; Carvalho, F. L.; Jordan, A. C.; Labaki, C.; Nagy, M.; O'Meara, T. A.; Pappa, T.; Pimenta, E. M.; Saad, E.; Yang, D. D.; Gillani, R.; Tewari, A. K.; Reardon, B.; Van Allen, E. M.
Show abstract
The rapid expansion of molecularly informed therapies in oncology, coupled with evolving regulatory FDA approvals, poses a challenge for oncologists seeking to integrate precision cancer medicine into patient care. Large Language Models (LLMs) have demonstrated potential for clinical applications, but their reliance on general knowledge limits their ability to provide up-to-date and niche treatment recommendations. To address this challenge, we developed a RAG-LLM workflow augmented with Molecular Oncology Almanac (MOAlmanac), a curated precision oncology knowledge resource, and evaluated this approach relative to alternative frameworks (i.e. LLM-only) in making biomarker-driven treatment recommendations using both unstructured and structured data. We evaluated performance across 234 therapy-biomarker relationships. Finally, we assessed real-world applicability of the workflow by testing it on actual queries from practicing oncologists. While LLM-only achieved 62-75% accuracy in biomarker-driven treatment recommendations, RAG-LLM achieved 79-91% accuracy with an unstructured database and 94-95% accuracy with a structured database. In addition to accuracy, structured context augmentation significantly increased precision (49% to 80%) and F1-score (57% to 84%) compared to unstructured data augmentation. In queries provided by practicing oncologists, RAG-LLM achieved 81-90% accuracy. These findings demonstrate that the RAG-LLM framework effectively delivers precise and reliable FDA-approved precision oncology therapy recommendations grounded in individualized clinical data, and highlight the importance of integrating a well-curated, structured knowledge base in this process. While our RAG-LLM approach significantly improved accuracy compared to standard LLMs, further efforts will enhance the generation of reliable responses for ambiguous or unsupported clinical scenarios.
Makani, A.
Show abstract
Medical oncology education faces a dual crisis: knowledge velocity that outpaces static curricula and large language model (LLM) risks--hallucination and automation bias--that threaten the fidelity of AI-assisted learning. We present Onco-Shikshak V7, an AI-native adaptive learning platform that addresses both challenges through a unified cognitive architecture grounded in learning science. The system replaces isolated educational modules with four authentic clinical workflows--Morning Report, Tumor Board, Clinic Day, and AI Textbook--each scaffolded by a nine-module pedagogy engine that integrates ACT-R activation dynamics (illness scripts), Item Response Theory (adaptive difficulty), the Free Spaced Repetition Scheduler (FSRS v4), Zone of Proximal Development (scaffolding), and metacognitive calibration training (Brier score). Six specialist AI agents--medical oncology, radiation oncology, surgical oncology, pathology, radiology, and oncology navigation--engage in multi-disciplinary deliberation with per-specialty retrieval-augmented generation (RAG) grounding across nine authoritative guideline sources including NCCN, ESMO, and ASTRO. The platform provides 18 clinical cases with decision trees across six cancer types, maps every interaction to 13 ACGME Hematology-Oncology milestones, and implements four closed-loop feedback mechanisms that connect session errors to targeted flashcards, weak domains to suggested cases, and all interactions to a persistent learner profile. Technical validation confirms algorithmic correctness across eight subsystems. To our knowledge, this is the first system to unify ACT-R, IRT, FSRS, ZPD, and metacognitive calibration in a single medical education platform. Formal learner evaluation via randomized controlled trial is planned.
Chen, L.-C.; Zack, T.; Demirci, A.; Sushil, M.; Miao, B.; Kasap, C.; Butte, A. J.; Collisson, E.; Hong, J.
Show abstract
PurposeWe examined the effectiveness of proprietary and open Large Language Models (LLMs) in detecting disease presence, location, and treatment response in pancreatic cancer from radiology reports. MethodsWe analyzed 203 deidentified radiology reports, manually annotated for disease status, location, and indeterminate nodules needing follow-up. Utilizing GPT-4, GPT-3.5-turbo, and open models like Gemma-7B and Llama3-8B, we employed strategies such as ablation and prompt engineering to boost accuracy. Discrepancies between human and model interpretations were reviewed by a secondary oncologist. ResultsAmong 164 pancreatic adenocarcinoma patients, GPT-4 showed the highest accuracy in inferring disease status, achieving a 75.5% correctness (F1-micro). Open models Mistral-7B and Llama3-8B performed comparably, with accuracies of 68.6% and 61.4%, respectively. Mistral-7B excelled in deriving correct inferences from "Objective Findings" directly. Most tested models demonstrated proficiency in identifying disease containing anatomical locations from a list of choices, with GPT-4 and Llama3-8B showing near parity in precision and recall for disease site identification. However, open models struggled with differentiating benign from malignant post-surgical changes, impacting their precision in identifying findings indeterminate for cancer. A secondary review occasionally favored GPT-3.5s interpretations, indicating the variability in human judgment. ConclusionLLMs, especially GPT-4, are proficient in deriving oncological insights from radiology reports. Their performance is enhanced by effective summarization strategies, demonstrating their potential in clinical support and healthcare analytics. This study also underscores the possibility of zero-shot open model utility in environments where proprietary models are restricted. Finally, by providing a set of annotated radiology reports, this paper presents a valuable dataset for further LLM research in oncology.
Tripathi, A. G.; Waqas, A.; Schabath, M. B.; Yilmaz, Y.; Rasool, G.
Show abstract
HONeYBEE (Harmonized ONcologY Biomedical Embedding Encoder) is an open-source framework that integrates multimodal biomedical data for oncology applications. It processes clinical data (structured and unstructured), whole-slide images, radiology scans, and molecular profiles to generate unified patient-level embeddings using domain-specific foundation models and fusion strategies. These embeddings enable survival prediction, cancer-type classification, patient similarity retrieval, and cohort clustering. Evaluated on 11,400+ patients across 33 cancer types from The Cancer Genome Atlas (TCGA), clinical embeddings showed the strongest single-modality performance with 98.5% classification accuracy and 96.4% precision@10 in patient retrieval. They also achieved the highest survival prediction concordance indices across most cancer types. Multimodal fusion provided complementary benefits for specific cancers, improving overall survival prediction beyond clinical features alone. Comparative evaluation of four large language models revealed that general-purpose models like Qwen3 outperformed specialized medical models for clinical text representation, though task-specific fine-tuning improved performance on heterogeneous data such as pathology reports.
Lopez-Garcia, G.; Xu, D.; Luu, M.; Zheng, R.; Daskivich, T. J.; Gonzalez-Hernandez, G.
Show abstract
Effective risk communication is essential to shared decision-making in prostate cancer care. However, the quality of physician communication of key tradeoffs varies widely in real-world consultations. Manual evaluation of communication is labor-intensive and not scalable. We present a structured, rubric-based framework that uses large language models (LLMs) to automatically score the quality of risk communication in prostate cancer consultations. Using transcripts from 20 clinical visits, we curated and annotated 487 physician-spoken sentences that referenced five decision-making domains: cancer prognosis, life expectancy, and three treatment side effects (erectile dysfunction, incontinence, and irritative urinary symptoms). Each sentence was assigned a score from 0 to 5 based on the precision and patient-specificity of communicated risk, using a validated scoring rubric. We modeled this task as five multiclass classification problems and evaluated both fine-tuned transformer baselines and GPT-4o with rubric-based and chain-of-thought (CoT) prompting. Our best performing approach, which combined rubric-based CoT prompting with few-shot learning, achieved micro averaged F1 scores between 85.0 and 92.0 across domains, outperforming supervised baselines and matching inter-annotator agreement. These findings establish a scalable foundation for AI-driven evaluation of physician-patient communication in oncology and beyond.
Das, R.; Maheswari, K.; Siddiqui, S.; Arora, N.; Paul, A.; Nanshi, J.; Udbalkar, V.; Sarvade, A.; Chaturvedi, H.; Shvartsman, T.; Masih, S.; Thippeswamy, R.; Patil, S.; Nirni, S. S.; Garsson, B.; Bandyopadhyay, S.; Maulik, U.; Farooq, M.; Sengupta, D.
Show abstract
The clinical adoption of Large Language Models (LLMs) in biomedical research has been limited by concerns regarding the quality, accuracy, and reliability of their outputs, particularly in precision oncology, where clinical decision-making demands high precision. Current models, often based on fine-tuned foundational LLMs, are prone to issues such as hallucinations, incoherent reasoning, and loss of context. In this work, we present GeneSilico Copilot, an advanced agent-based architecture that transforms LLMs from simple response synthesizers to clinical reasoning systems. Our approach is centred around a bespoke ReAct agent that orchestrates a suite of specialized tools for asynchronous information retrieval and synthesis. These tools access curated document vector stores containing clinical treatment guidelines, genomic insights, drug information, clinical trials, and breast cancer-specific literature. To leverage large context windows of current LLMs, we implement a hybrid search strategy that prioritizes key information and dynamically integrates summarized content, reducing context fragmentation. Incorporating additional metadata further allows for precise, transparent and evidence-backed reasoning at each step of the thought process. The system ensures that at every stage, the agent can synthesize meaningful, context-aware observations that contribute to a coherent and comprehensive final response that aligns with clinical standards. Evaluations on real-world breast cancer cases show that GeneSilico Copilot significantly improves response accuracy and personalization. This system represents a critical advancement toward making LLMs clinically deployable in precision oncology and has potential applications in broader medical domains requiring complex, data-driven decision-making.
Dennstaedt, F.; Bobnar, T.; Handra, A.; Putora, P. M.; Filchenko, I.; Brueningk, S.; Aebersold, D. M.; Cihoric, N.; Shelan, M.
Show abstract
BackgroundThe growing volume of biomedical literature, especially in oncology, necessitates automated tools for extracting clinically relevant information. Large Language Models (LLMs) offer promising capabilities for data extraction in this domain. However, their potential to extract clinically relevant information from case reports detailing rare treatment interactions, remains underexplored. MethodsWe systematically searched PubMed for case reports on interactions between radiotherapy (RT) and Pembrolizumab, Cetuximab, or Cisplatin. A random sample of 100 report abstracts for each therapy was manually classified by two independent medical experts using 17 Boolean questions about patient demographics, treatment, cancer type and outcome with mutually exclusive answers, forming a ground truth. An LLM-based system with the open-source GPT models (GPT-OSS-120B and GPT-OSS-20B) was applied to classify these reports and the remaining dataset entries using the defined question structure. Performance of the LLM-based information extraction was evaluated using the standard classification metrics accuracy, precision, recall, and F1-scores. ResultsThe systematic searches yielded 320 (Pembrolizumab), 147 (Cetuximab), and 2055 (Cisplatin) publications. Inter-rater agreement for manual classification was high (Cohens kappa = 0.87), though lower (0.60-0.80) for specific outcome and cancer type questions. The LLM-based classification (GPT-OSS-120B model) achieved high overall performance with an F1-score of 94.33% (95.83% accuracy, 93.69% precision, 94.98% recall). Performance was consistent across systemic therapies, with the smaller GPT-OSS-20B model showing similar results (F1-score 94.06%). Analysis of the entire datasets revealed that 56.02% of publications described patients who received both RT and systemic therapy. Proportions of positive and negative outcomes varied by therapy and sequencing. ConclusionsLLM-based classification systems demonstrate high accuracy and reliability for curating scientific case reports on RT and systemic therapy interactions. These findings support their potential for high-throughput hypothesis generation and knowledge base construction in oncology, particularly for underutilized case reports, with even smaller open-source models proving effective for such tasks.
Prelaj, A.; Miskovic, V.; Sacco, M.; Ferrarin, A.; Licciardello, C.; Provenzano, L.; Favali, M.; Lerma, L.; Zec, A.; Spagnoletti, A.; Ganzinelli, M.; Lorenzini, D.; Guirges, B.; Invernizzi, L.; Silvestri, C.; Mazzeo, L.; Meazza Prina, M.; Corrao, G.; Ruggirello, M.; Dumitrascu, A. D.; Di Mauro, R. M.; Monzani, D.; Pravettoni, G.; Zanitti, M.; Macocchi, D.; Marino, M.; Cavalli, C.; Romano, R.; Giani, C.; Armato, S. G.; Esposito, A.; Bestvina, C.; Spector, M.; Bogot, N. R.; Basheer, R.; Hafzadi, A. L.; Roisman, L.; Watermann, I.; Szewczyk, M.; Olchers, T.; Richter, H.; Blanke-Roeser, C.; Sinisca
Show abstract
Despite a decade of immunotherapy, treatment selection in non-small cell lung cancer (NSCLC) still relies on subgroup analyses and clinical scores. I3LUNG (NCT05537922) is currently the largest international, real-world, multimodal, artificial intelligence (AI)-based trial, enrolling 2365 patients. We integrated real-world clinical data (RWD), computed tomography (CT) images, digital pathology (DP), and genomics (G) into machine learning early-fusion (MLEF) and deep-learning intermediate-fusion (DLIF) models. MLEF achieved consistent performance across outcomes (AUC{approx}0.74), with improved results in first-line patients (AUC up to 0.82). Multimodal models outperformed RWD in clinical-specific subgroups (AUCs up to 0.86). In the test set, AI models surpassed PD-L1, ECOG PS, NLR, LDH (all with p<0.01) and the LIPI score. The clinical usability study showed that expert and non-expert physicians could improve their prediction with the explainable AI (XAI) tool. The I3LUNG tool emerges as a clinically relevant decision-support system and is currently under prospective validation in >2,000 patients.
Ahmed, S.; Parker, N.; Park, M.; Davis, E. W.; Jeong, D.; Permuth, J. B.; Schabath, M. B.; Yilmaz, Y.; Rasool, G.
Show abstract
Cancer cachexia, a multifactorial metabolic syndrome characterized by severe muscle wasting and weight loss, contributes to poor outcomes across various cancer types but lacks a standardized, generalizable biomarker for early detection. We present a multimodal AI-based biomarker trained on real-world clinical, radiologic, laboratory, and unstructured clinical note data, leveraging foundation models and large language models (LLMs) to identify cachexia at the time of cancer diagnosis. Prediction accuracy improved with each added modality: 77% using clinical variables alone, 81% with added laboratory data, and 85% with structured symptom features extracted from clinical notes. Incorporating embeddings from clinical text and CT images further improved accuracy to 92%. The framework also demonstrated prognostic utility, improving survival prediction as data modalities were integrated. Designed for real-world clinical deployment, the framework accommodates missing modalities without requiring imputation or case exclusion, supporting scalability across diverse oncology settings. Unlike prior models trained on curated datasets, our approach utilizes standard-of-care clinical data, facilitating integration into oncology workflows. In contrast to fixed-threshold composite indices such as the cachexia index (CXI), the model generates patient-specific predictions, enabling adaptable, cancer-agnostic performance. To enhance clinical reliability and safety, the framework incorporates uncertainty estimation to flag low-confidence cases for expert review. This work advances a clinically applicable, scalable, and trustworthy AI-driven decision support tool for early cachexia detection and personalized oncology care.
Corso, F.; Peppoloni, V.; Mazzeo, L.; Leone, G.; Passos, L.; Miskovic, V.; Armanini, J.; Ferrarin, A.; Wiest, I. C.; Wolf, F.; Montelatici, G.; Romano', R.; Ambrosini, P.; Capoccia, T.; Natangelo, S.; Rota, S.; Andena, P.; De Ponti, M.; Russo, A.; Stasi, G.; Provenzano, L.; Spagnoletti, A.; Meazza Prina, M.; Cavalli, C.; Giani, C.; Serino, R.; Borraccino, M.; Bonalume, C.; Di Mauro, R. M.; Agosta, C.; Dumitrascu, A. D.; Di Liberti, G.; Corrao, G.; Beninato, T.; Ganzinelli, M.; Occhipinti, M.; Brambilla, M.; Proto, C.; Kather, J. N.; Pedrocchi, A. L. G.; De Braud, F.; Lo Russo, G.; Baili, P.; P
Show abstract
Real-world data (RWD), largely stored in unstructured electronic health records (EHRs), are critical for understanding complex diseases like cancer. However, extracting structured information from these narratives is challenging due to linguistic variability, semantic complexity, and privacy concerns. This study evaluates the performance of four locally deployable and small language models (SLMs), LLaMA, Mistral, BioMistral, and MedLLaMA, for information extraction (IE) from Italian EHRs within the APOLLO 11 trial on non-small cell lung cancer (NSCLC). We examined three prompting strategies (zero-shot, few-shot, and annotated few-shot) across English and Italian, involving clinicians with varying expertise to assess prompt designs impact on accuracy. Results show that general-purpose models (e.g., LLaMA 3.1 8B) outperform biomedical models in most tasks, particularly in extracting binary features. Multiclass variables such as TNM staging, PD-L1, and ECOG were more difficult due to implicit language and lack of standardization. Few-shot prompting and native-language inputs significantly improved performance and reduced hallucinations. Clinical expertise enhanced consistency in annotation, particularly among students using annotated examples. The study confirms that privacy-preserving SLMs can be deployed locally for efficient and secure cancer data extraction. Findings highlight the need for hybrid systems combining SLMs with expert input and underline the importance of aligning clinical documentation practices with SLM capabilities. This is the first study to benchmark SLMs on Italian EHRs and investigate the role of clinical expertise in prompt engineering, offering valuable insights for the future integration of SLMs into real-world clinical workflows.
Ankolekar, A.; Boie, S.; Abdollahyan, M.; Gadaleta, E.; Hasheminasab, S. A.; Yang, G.; Beauville, C.; Dikaios, N.; Kastis, G. A.; Bussmann, M.; Khalid, S.; Kruger, H.; Lambin, P.; Papanastasiou, G.
Show abstract
Federated Learning (FL) has emerged as a promising solution to address the limitations of centralised machine learning (ML) in oncology, particularly in overcoming privacy concerns and harnessing the power of diverse, multi-center data. This systematic review synthesises current knowledge on the state-of-the-art FL in oncology, focusing on breast, lung, and prostate cancer. Distinct from previous surveys, our comprehensive review critically evaluates the real-world implementation and impact of FL on cancer care, demonstrating its effectiveness in enhancing ML generalisability, performance and data privacy in clinical settings and data. We evaluated state-of-the-art advances in FL, demonstrating its growing adoption amid tightening data privacy regulations. FL outperformed centralised ML in 15 out of the 25 studies reviewed, spanning diverse ML models and clinical applications, and facilitating integration of multi-modal information for precision medicine. Despite the current challenges identified in reproducibility, standardisation and methodology across studies, the demonstrable benefits of FL in harnessing real-world data and addressing clinical needs highlight its significant potential for advancing cancer research. We propose that future research should focus on addressing these limitations and investigating further advanced FL methods, to fully harness data diversity and realise the transformative power of cutting-edge FL in cancer care.
Rouzier, R.; Harter, V.; Rouzier, E.; Ferment, V.; Gruau, S.; Andre, B.; Saumon Sud, C.; Nadin, L.; Corroyer Dulmont, A.; Vigneron, N.
Show abstract
Cancer staging plays a critical role in treatment planning and prognosis but is often embedded in unstructured clinical narratives. To automate the extraction and structuring of staging data, large language models (LLMs) have emerged as a promising approach. However, their performance in real-world oncology settings has yet to be systematically evaluated. Herein, we analysed 1000 oncological summaries from patients receiving treatment for breast cancer between 2019 and 2020 at the Francois Baclesse Comprehensive Cancer Centre, France. Five Mistral artificial intelligence-based LLMs were evaluated (i.e. Small, Medium, Large, Magistral and Mistral:latest) for their ability to derive the cancer stage and identify staging elements. Larger models outperformed their smaller counterparts in staging accuracy and reproducibility (kappa > 0.95 for Mistral Large and Medium). Mistral Large achieved the highest accuracy in deriving the cancer stage (93.0%), surpassing the original clinical documentation in several cases. The LLMs consistently performed better in deriving the cancer stage when working through tumour size, nodal status and metastatic components compared to when they were directly requested stage data. The top-performing models had a test-retest reliability exceeding 97%, while smaller models and locally deployed versions lacked sufficient robustness, particularly in handling unit conversions and complex staging rules. The structured, stepwise use of LLMs that emulates clinician reasoning offers a more efficient, transparent and reproducible approach to cancer staging, and the study findings support LLM integration into digital oncology workflows.
Chen, S.; Kann, B. H.; Foote, M. B.; Aerts, H. J.; Savova, G. K.; Mak, R. H.; Bitterman, D. S.
Show abstract
The use of large language models (LLMs) such as ChatGPT for medical question-answering is becoming increasingly popular. However, there are concerns that these models may generate and amplify medical misinformation. Because cancer patients frequently seek to educate themselves through online resources, some individuals will likely use ChatGPT to obtain cancer treatment information. This study evaluated the performance and robustness of ChatGPT in providing breast, prostate, and lung cancer treatment recommendations that align with National Comprehensive Cancer Network (NCCN) guidelines. Four prompt templates were created to explore how differences in how the query is posed impacts response. ChatGPT output was scored by 3 oncologists and a 4th oncologist adjudicated in cases of disagreement. ChatGPT provided at least one NCCN-concordant recommendation for 102/104 (98%) prompts. However, 35/102 (34.3%) of these also included a recommendation that was at least partially non-concordant with NCCN guidelines. Responses varied based on prompt type. In conclusion, ChatGPT did not perform well at reliably and robustly providing cancer treatment recommendations. Patients and clinicians should be aware of the limitations of ChatGPT and similar technologies for self-education.
Vesteghem, C.; Dahl, S. C.; Broendum, R. F.; Soenderkaer, M.; Boedker, J. S.; Schmitz, A.; Weischenfeldt, J.; Pedersen, I. S.; Sommer, M.; Rytter, A. S.; Nielsen, M. M.; Ladekarl, M.; Severinsen, M. T.; Dybkaer, K.; Groenbaek, K.; El-Galaly, T.; Roug, A. S.; Boegsted, M.
Show abstract
ObjectivesTo facilitate clinical implementation and research in precision oncology, notably the pairing of patients, variants and treatments to identify candidates for clinical trials, we have built a data infrastructure to 1) capture and store data, 2) reduce manual tasks for clinical and genomic data collection and management, 3) combine data for quality controls, reporting and findability. InfrastructureThe infrastructure uses REDCap repositories to capture and store data. The structure of these repositories is customized for each project. Additionally, a cross-project web platform was developed using software development best practices and state-of-the-art web technologies to circumvent REDCaps limitations and integrate other third-party resources. Using REDCaps application programming interfaces, this platform allowed validation of data across multiple repositories, easy import of data from external sources, generation of overviews of included patients and available data, combination of genomic and clinical data to generate tumour board reports and the findability of data. Its design was driven by data stewardship best practices. UsageAcross four precision medicine projects, the infrastructure has been used to collect data for 1921 patients, including 453 genomic data files. The custom-built web platform made it possible to import, validate, and present data in a comprehensive manner. This included building tumour board reports for clinicians, combining clinical and genomic data, and search functionalities for researchers. DiscussionREDCap allowed us to capitalize on the numerous data capture and management features developed in this solution. Designing a cross-project platform guarantees long-term relevance where developments can be mutualised across projects and allowed us to make the overall solution more compliant with the FAIR (Findable, Accessible, Interoperable, Reusable) data principles. Further developments should be considered, notably automatic retrieval of data from electronic health records to limit the number of manual tasks. ConclusionThe proposed infrastructure allowed our precision oncology projects to gain efficiency in data collection and increase data quality by reducing manual work, and it gave a straightforward and customized access to data for researchers and clinicians.